39 research outputs found
Recent advances in deep learning for object detection
Object detection is a fundamental visual recognition problem in computer
vision and has been widely studied in the past decades. Visual object detection
aims to find objects of certain target classes with precise localization in a
given image and assign each object instance a corresponding class label. Due to
the tremendous successes of deep learning based image classification, object
detection techniques using deep learning have been actively studied in recent
years. In this paper, we give a comprehensive survey of recent advances in
visual object detection with deep learning. By reviewing a large body of recent
related work in literature, we systematically analyze the existing object
detection frameworks and organize the survey into three major parts: (i)
detection components, (ii) learning strategies, and (iii) applications &
benchmarks. In the survey, we cover a variety of factors affecting the
detection performance in detail, such as detector architectures, feature
learning, proposal generation, sampling strategies, etc. Finally, we discuss
several future directions to facilitate and spur future research for visual
object detection with deep learning. Keywords: Object Detection, Deep Learning,
Deep Convolutional Neural Network
OTW: Optimal Transport Warping for Time Series
Dynamic Time Warping (DTW) has become the pragmatic choice for measuring
distance between time series. However, it suffers from unavoidable quadratic
time complexity when the optimal alignment matrix needs to be computed exactly.
This hinders its use in deep learning architectures, where layers involving DTW
computations cause severe bottlenecks. To alleviate these issues, we introduce
a new metric for time series data based on the Optimal Transport (OT)
framework, called Optimal Transport Warping (OTW). OTW enjoys linear time/space
complexity, is differentiable and can be parallelized. OTW enjoys a moderate
sensitivity to time and shape distortions, making it ideal for time series. We
show the efficacy and efficiency of OTW on 1-Nearest Neighbor Classification
and Hierarchical Clustering, as well as in the case of using OTW instead of DTW
in Deep Learning architectures.Comment: This is an extended version of an ICASSP 2023 accepted paper
https://ieeexplore.ieee.org/document/1009591
Multimodal Transformer Networks for End-to-End Video-Grounded Dialogue Systems
Developing Video-Grounded Dialogue Systems (VGDS), where a dialogue is
conducted based on visual and audio aspects of a given video, is significantly
more challenging than traditional image or text-grounded dialogue systems
because (1) feature space of videos span across multiple picture frames, making
it difficult to obtain semantic information; and (2) a dialogue agent must
perceive and process information from different modalities (audio, video,
caption, etc.) to obtain a comprehensive understanding. Most existing work is
based on RNNs and sequence-to-sequence architectures, which are not very
effective for capturing complex long-term dependencies (like in videos). To
overcome this, we propose Multimodal Transformer Networks (MTN) to encode
videos and incorporate information from different modalities. We also propose
query-aware attention through an auto-encoder to extract query-aware features
from non-text modalities. We develop a training procedure to simulate
token-level decoding to improve the quality of generated responses during
inference. We get state of the art performance on Dialogue System Technology
Challenge 7 (DSTC7). Our model also generalizes to another multimodal
visual-grounded dialogue task, and obtains promising performance. We
implemented our models using PyTorch and the code is released at
https://github.com/henryhungle/MTN.Comment: Accepted at ACL 2019 (Long Paper